Multi-Core Ant Colony Optimization for TSP in Go

This is a continuation of Parallel Ant Colony Optimization for TSP in Standard ML, Multi-Core Ant Colony Optimization for TSP in Haskell, Multi-Core Ant Colony Optimization for TSP in Erlang, and Multi-Core Ant Colony Optimization for TSP in Scala. See first page for Ant Colony and TSP problem description. Here the program has been rewritten in the programming language Go.

Go is a new statically typed, garbage collected, concurrent programming language. It is native compiled and offers near-C performance. Like Erlang it uses a CSP model of lightweight processes ("goroutines") communicating via message passing. Messages can be synchronous (unbuffered) or asynchronous. Go supports multi-core processors.

On my usual ACO TSP test case Go is the best performing language so far. On a single core it was 2.8x slower than C, beating the previous best Shedskin Python. Go's speedup from 1 to 2 cores was 1.9, matching previous best Erlang. With 4 cores Go exceeds C's single core performance -- the first tested language to achieve this goal.

Go's lightweight processes are scheduled onto OS threads by the Go runtime system. This allows the system to support many more processes. I was able to run ACO TSP with 100,000 ant processes. The runtime system scheduler appears to treat these processes somewhat like futures, lazily scheduling them when another process is blocking on receipt of a message from a shared channel. This is inefficient for the type of parallel processing used in ACO (many seconds of parallel computation ending with a single message send), as only a single process (and thus a single core) is active at a time. This is a known issue, and the simple solution suggested on the golang-nuts mailing list is to add runtime.Gosched() calls to yield the processor. For my case this was insufficient, and adding additional heartbeat messages to each process helped force maximal multi-core usage.

Environment

All tests performed on Intel Core2 Quad CPU 2.4 GHz, 2 GB memory.
All on Debian Linux quad 2.6.22-3-686 #1 SMP Sun Feb 10 20:20:49 UTC 2008 i686

Versions

Go 8g and 8l built from repository
gcc (Debian 4.4.2-8) 4.4.2

Test Case

Iterations raised to 10,000, as times were getting too low. Still 200 cities.

Command Lines

8g ant.go; 8l ant.8
time ./8.out

gcc -Wall -O3 -s ant1_c.c -o ant1_c
time ./ant1_c
Note GCC is optimized, while Go has no optimizer yet. GCC is getting over 2x speedup from optimization; hopefully Go can improve similarly.

Code

ant.go
Leonardo Maffi's ant1_c.c

Blog

comments

Update 10/2010

I've re-run the Go ACO TSP program on a 6 core processor with hyper-threading. The Intel i7-980X supports two hardware threads per core, so the OS sees it as a 12-core processor.
Intel(R) Core(TM) i7 CPU       X 980  @ 3.33GHz
Linux hex 2.6.35-22-generic #35-Ubuntu SMP Sat Oct 16 20:45:36 UTC 2010 x86_64 GNU/Linux
Ubuntu 10.10

6g ant.go; 6l ant.6
time ./6.out

This testing was paired with Real Time Ray Tracing.

Here hyper-threading has improved performance for both programs in all configurations. At one core both programs were sped up 28%. At 6 cores ray tracing was sped up 9% versus non-HT, while ACO TSP was sped up 18% versus non-HT. The higher speedup for ACO TSP is expected as it matches the general speedup from increased cores.

Both programs were run with 24 threads in all test configurations. Having more program threads than cores*hardware_threads_per_core is key. Otherwise the OS may schedule multiple threads on the same core while leaving others idle. Linux is hyper-theading aware and attempts to avoid this scheduling problem, but doesn't always get it right. In testing with 6 threads and 6 cores, I saw 28% decreases in performance for both programs when hyper-threading was enabled.

The Intel i7-980X also supports "Turbo Boost", where one core is automatically overclocked when other cores are idle. Turbo boost was disabled for the tests above. With one core and HT disabled, Turbo boost provides a speedup of 8% for both programs. Surprisingly, with six cores and HT enabled (so all cores should be fully utilized) Turbo boost provides a speedup of 5% for both programs.

The performance of Go has also improved relative to the single-threaded C version of ACO TSP. For 24 ants the C version finishes in 60 seconds, so on one core Go is only 1.7x slower than C. At 2 cores Go has exceeded the performance of the single-threaded C version. Previously Go was 2.8x slower than C on one core; I don't know whether this improvement is due to an improved Go compiler, the move to 64-bit compilers and OS, or architectural differences between the Intel Core 2 and i7-980X.